12/01/2019
Goal:
Want objects clustered together to be more similar, than objects clustered differently
The catch:
We do not know the true structure!
Key difference to supervised learning
Some set of points, \(x_i\), where \(i = 1, \dots, N\)
Each of these points have some attributes, \(j\), where \(j = 1 \dots, p\)
Example dataset: Iris
## 'data.frame': 150 obs. of 5 variables: ## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... ## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... ## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... ## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... ## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Define a dissimilarity between each attribute: \[d_j(x_{ij}, x_{i'j})\] Definte a dissimilarity between points: \[ D(x_i, x_{i'})= \sum_{j=1}^p d_j (x_{ij}, x_{i'j}) \] Common choice is squared Euclidean distance: \[D(x_i, x_{i'}) = \sum_{j=1}^p d_j(x_{ij},x_{i'j}) = \sum_{j=1}^p (x_{ij} - x_{i'j})^2\]
\[ D(x_i, x_{i'}) = \sum_{j=1}^p w_j \cdot d_j(x_{ij}, x_{i'j}); \quad \sum_{j=1}^p w_j = 1 \]
setting \(w_j\) to \(\tfrac{1}{p}\) does not necessarily mean all attributes provide equal contribution
an alternative is to weight by the overall average contribution of each attribute, so let \(w_j = \frac{1}{\bar{d_j}}\) where
\[ \bar{d}_j = \frac{1}{N^2}\sum^N_{i=1}\sum^N_{i'=1}d_j(x_{ij}, x_{i'j}) \]
Different types of dissimilarities:
Dissimilarities are not necessarily distances
combinatorial - work on the data with no reference to an underlying probability model
mixture models - assume data is iid sampled from some population, each cluster corresponds to one component of the mixture density
mode seekers (bump hunters) - non-parametric approach, attempting to estimate modes of the distribution function
Some set of points, \(x_i\), where \(i = 1, \dots, N\)
Prespecify a number of clusters \(K < N\)
Index each cluster by \(k \in \{1, \dots, K\}\)
Each observations can only belong to one cluster
Define an encoder function, \(k = C(i)\), so \(x_i\) is assigned label \(k\)
Assume there is some 'optimal' encoder \(C^*(i)\) that captures the true underlying structure
"Total" point scatter = "Within" point scatter + "Between" point scatter
\[ \begin{aligned} T(C) &= \frac{1}{2} \sum_{i=1}^N\sum_{i'= 1}^N D(x_i, x_i') \\ &= \frac{1}{2}\sum_{k=1}^K\sum_{C(i) = k}\left( \sum_{C(i') =k} D(x_i, x_{i'}) + \sum_{C(i') \neq k} D(x_i, x_{i'}) \right) \\ &= W(C) + B(C) \end{aligned}\]
Minimise \(W(C)\)
Maximise \(B(C)\)
Can't evaluate this for all possible encoders - get greedy
Squared Euclidean distance
Minimise \(W(C)\)
\[ \begin{aligned} W(C) &= \frac{1}{2}\sum_{k=1}^K\sum_{C(i) = k}\sum_{C(i')=k} \|x_i - x_{i'}\|^2 \\ &=\sum_{k=1}^K N_k \sum_{C(i) = k}\|x_i - \bar{x}_k \|^2, \end{aligned}\] where \(N_k\) are the number of points in the cluster with index \(k\) and \(\bar{x}_k\) is the mean vector.
Note:
Result will depend on our initial point assignment
Algorithm will converge to a local minima, but not necessarily global mimina.
http://rpubs.com/Nitika/kmeans_Iris
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa
iris.new<- iris[,c(1,2,3,4)] iris.class<- iris[,"Species"]
summary(iris.new)
## Sepal.Length Sepal.Width Petal.Length Petal.Width ## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 ## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 ## Median :5.800 Median :3.000 Median :4.350 Median :1.300 ## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 ## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 ## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
normalize <- function(x){return ((x-min(x))/(max(x)-min(x)))}
iris.new$Sepal.Length <- normalize(iris.new$Sepal.Length)
iris.new$Sepal.Width <- normalize(iris.new$Sepal.Width)
iris.new$Petal.Length <- normalize(iris.new$Petal.Length)
iris.new$Petal.Width <- normalize(iris.new$Petal.Width)
result <- kmeans(x = iris.new, centers = 3, nstart = 10) result$size
## [1] 50 61 39
result$centers
## Sepal.Length Sepal.Width Petal.Length Petal.Width ## 1 0.1961111 0.5950000 0.07830508 0.06083333 ## 2 0.4412568 0.3073770 0.57571548 0.54918033 ## 3 0.7072650 0.4508547 0.79704476 0.82478632
result$cluster
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ## [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 ## [71] 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 ## [106] 3 2 3 3 3 3 3 3 2 3 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 2 3 3 3 2 3 ## [141] 3 3 2 3 3 3 2 3 3 2
Total number of correctly classified instances are:
36 + 47 + 50 = 133
Total number of incorrectly classified instances are:
3 + 14 = 17
Accuracy = 133/(133+17) = 0.88 i.e the model has achieved 88% accuracy
## iris.class ## setosa versicolor virginica ## 1 50 0 0 ## 2 0 47 14 ## 3 0 3 36
Lots of methods - check out package, NbClust
http://rpubs.com/Nitika/kmeans_Airquality (What not to do!!!)
data("airquality")
str(airquality)
## 'data.frame': 153 obs. of 6 variables: ## $ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ... ## $ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ... ## $ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ... ## $ Temp : int 67 72 74 62 56 66 65 59 61 69 ... ## $ Month : int 5 5 5 5 5 5 5 5 5 5 ... ## $ Day : int 1 2 3 4 5 6 7 8 9 10 ...
#Normalize the dataset
normalize<- function(x){
return((x-min(x))/(max(x)-min(x)))
}
# replace contents of dataset with normalized values
airquality <- normalize(airquality)
# apply k-means algorithmusing first 4 attributes and with k=3
result<- kmeans(airquality[c(1,2,3,4)], 3)
K-means not robust to outliers
Method is commonly phrased using squared Euclidean distance
Optimisation requires a notion of a mean/centroid
Standard implementation is in terms of points not distances
Optimise with respect to one of the points, instead of the mean
More robust to outliers
Implementation is in terms of either distances or points
Randomly select an initial set of \(K\) stations. These are the initial medoids, \(m_k\).
Assign each station, \(x_i\), to its closest medoid, \(m_k\)
For each cluster update the medoid. The new medoid is the station within that cluster such that minimises the sum of within cluster distances \[ m_k = \mathop{\mathrm{argmin}}\limits_{x_i \in C_k} \sum_{C(i') =k} D(x_i, x_{i'}). \]
Repeat steps 2–4 until the medoids are unchanged.
library(cluster)
result_kmedoids_1 <- pam(x = iris.new, k = 3, diss = FALSE)
result_kmedoids_2 <- pam(x = dist(iris.new), k = 3, diss = TRUE)
result_kmedoids_3 <- pam(x = iris.new, k = 3,
diss = FALSE, metric = "euclidean")
Let \(m_k\) be the medoid of cluster with label \(k\).
For a point \(x_i\), \(m_k\) is it's closest medoid.
Define \(m_{k'}\) to be the next closest medoid.
Silhouette coefficent: \[ s_i(k) = 1 - \dfrac{D(m_k,x_i)}{D(m_{k'}, x_i)} \] Average silhouette coefficient: \[ \bar{s}(K) = \dfrac{1}{N} \sum_{k = 1}^K \sum_{k = C(i)} s_i(k) \]
Consider the \(\max\{D(x_i,x_j), 2\}\) as the clustering distance.
still have to pick \(K\)
sensitive to point density
again, optimisation makes implicit assumptions about distance
greedy
slow for large numbers of points
Don't have to pick \(K\)
There is an implicit ordering or hierachy within the clustering
hc <- hclust(dist(iris.new), method = "average") plot(hc, hang = -1, labels = FALSE, col = TRUE) clust1 = cutree(hc1, k = 3)
Single: \[D(C_k, C_{k'}) = \min \{D(x_k, x_{k'}) : x_k \in C_k, x_{k'} \in C_{k'} \},\] also known as nearest neighbour and is the minimim pairwise distance.
Complete (Furthest Neighbour): \[ D(C_k, C_{k'}) = \max \{D(x_k, x_{k'}) : x_k \in C_k, x_{k'} \in C_{k'} \}, \] also known as further neighbour and is the maximim pairwise distance.
Group Average: \[ D(C_k, C_{k'}) = \frac{1}{|C_k|\,|C_{k'}|} \sum_{x_k \in C_k} \sum_{x_{k'} \in C_{k'}} D(x_k, x_{k'}), \] also known as the unweighted pair group method with arithmetic mean (UPGMA).
To generate the agglomerative dendrogram:
Covered K-means, K-medoids and Hierarchical Clustering
Covered worked example in base R
Looked at strengths and limitations
Touched on cluster selection